Introduction to Python for NLP

Programming in Python

  • Basic Data Types & Operations
    • Arithmetic
    • Variable Assignment
    • Strings
    • Lists
  • A few tricks up our sleeve
    • String Methods
    • List Comprehension

0. Building Intuition

Inside the folder that contains this script, there is a plaintext file with notes that I took during Professor Emily Thornbury's talk "Stop Having Ideas and Start Counting." The title contains two present progressive (-ing) verbs, which might suggest we look for others in the body of the text, if we were to do something like a literary analysis.

Part of the reason why people use Python to do work on human-language texts (natural language processing) is because it makes tasks like this relatively simple.


In [ ]:
# (don't worry about understanding everything here)

for line in open('lecture notes 09-22-15.txt'):
    for word in line.split():
        if word.endswith('ing'):
            print(word)

1. Basic Data Types & Operations

Arithmetic

Before doing any more NLP, let's start with the basics. Any time you work with computers, it is essential to remember that they are simply counting machines. Place-holders with zeros and ones represent numbers that get added and subtracted from one another. This is true even for respresentations of language -- but let's not get ahead of ourselves.


In [ ]:
# Addition

2+5

In [ ]:
# Let's have Python report the results from three operations at the same time

print(2-5)
print(2*5)
print(2/5)

In [ ]:
# If we have all of our operations in the last line of the cell, Jupyter will print them together

2-5, 2*5, 2/5

In [ ]:
# And let's compare values

2>5

Variable assignment

Assigning variables is something that we do all the time in programming. These aren't quite like the variables from high school algebra, where x represents an unknown to solve for. Instead these are like notes to ourselves that we want to save some value(s) for later use.

Note that the equals sign is directional, like an arrow, telling the computer to give a certain value to a certain label.


In [ ]:
# 'a' is being given the value 2; 'b' is given 5

a = 2
b = 5

In [ ]:
# Let's perform an operation on the variables

a+b

In [ ]:
# Variables can have many different kinds of names

this_number = 2
b/this_number

Strings

In Python, human language text gets represented as a string. These contain sequential sets of characters and they are offset by quotation marks, either double (") or single (').

We will explore different kinds of operations in Python that are specific to human language objects, but it is useful to start by trying to see them as the computer does, as numerical representations.


In [ ]:
# The iconic string

print("Hello, World!")

In [ ]:
# Assign these strings to variables

a = "Hello"
b = 'World'

In [ ]:
# Try out arithmetic operations.
# When we add strings we call it 'concatenation'

print(a+b)
print(a*5)

In [ ]:
# Unlike a number that consists of a single value, a string is a
# sequence of characters. We can find out the length of that sequence.

len("Hello, World!")

In [ ]:
## EX. How long is the string below?

this_string = "It was the best of times; it was the worst of times."

Lists

The numbers and strings we have just looked at are the two basic data types that we will focus our attention on in this workshop. (In a few days, we will look at a third data type, boolean, which consists of True/False values.) When we are working with just a few numbers or strings, it is easy to keep track of them, but as we collect more we will want a system to organize them.

One such organizational system is a list. This contains values (regardless of type) in order, and we can perform operations on it very similarly to the way we did with numbers.


In [ ]:
# A list in which each element is a string

['Call', 'me', 'Ishmael']

In [ ]:
# Let's assign a couple lists to variables

list1 = ['Call', 'me', 'Ishmael']
list2 = ['In', 'the', 'beginning']

In [ ]:
## Q. Predict what will happen when we perform the following operations

print(list1+list2)
print(list1*5)

In [ ]:
# As with a string, we can find out the length of a list

len(list1)

In [ ]:
# Sometimes we just want a single value from the list at a time

list1[0]

In [ ]:
list1[1]

In [ ]:
list1[2]

In [ ]:
# Or maybe we want the first few

list1[0:2]

In [ ]:
list1[:2]

In [ ]:
# Of course, lists can contain numbers or even a mix of numbers and strings

list3 = [7,8,9]
list4 = [7,'ate',9]

In [ ]:
# And python is smart with numbers, so we can add them easily!

sum(list3)

In [ ]:
## EX. Concatenate 'list1' and 'list2' into a single list.
##     Retrieve the third element from the combined list.
##     Retrieve the fourth through sixth elements from the combined list.

2. A Few Tricks Up Our Sleeve

String Methods

The creators of Python recognize that human language has many important yet idiosyncratic features, so they have tried to make it easy for us to identify and manipulate them. For example, in the demonstration at the very beginning of the workshop, we referred to the idea of the suffix: the final letters of a word tell us something about its grammatical role and potentially the author's argument.

We can analyze or manipulate certain features of a string using its methods. These are basically internal functions that every string automatically possesses. Note that even though the method may transform the string at hand, they don't change it permanently!


In [ ]:
# Let's assign a variable to perform methods upon

greeting = "Hello, World!"

In [ ]:
# We saw the 'endswith' method at the very beginning
# Note the type of output that gets printed

greeting.startswith('H'), greeting.endswith('d')

In [ ]:
# We can check whether the string is a letter or a number

this_string = 'f'

this_string.isalpha()

In [ ]:
# When there are multiple characters, it checks whether *all*
# of the characters belong to that category

greeting.isalpha(), greeting.isdigit()

In [ ]:
# Similarly, we can check whether the string is lower or upper case

greeting.islower(), greeting.isupper(), greeting.istitle()

In [ ]:
# Sometimes we want not just to check, but to change the string

greeting.lower(), greeting.upper()

In [ ]:
# The case of the string hasn't changed!

greeting

In [ ]:
# But if we want to permanently make it lower case we re-assign it

greeting = greeting.lower()

In [ ]:
greeting

In [ ]:
# Oh hey. And strings are kind of like lists, so we can slice them similarly

greeting[:3]

In [ ]:
# Strings may be like lists of characters, but as humans we often treat them as
# lists of words. We can tell the computer to divide a string into word-like units.

greeting.split()

In [ ]:
## EX. Return the second through eighth characters in 'greeting'

## EX. Split the string below into a list of words and assign this to a new variable
## Note: A slash at the end of a line allows a string to continue onto the next one unbroken

In [ ]:
new_string = "It, is a truth universally acknowledged, that a single \
man in possession of a good fortune must be in want of a wife."

List Comprehension

List comprehensions are a fairly advanced programming technique that we will spend more time talking about tomorrow. For now, you can think of them as list filters. Often, we don't need every value in a list, just a few that fulfill certain criteria.


In [ ]:
# 'list1' had contained three words, two of which were in title case.
# We can automatically return those words using a list comprehension

[word for word in list1 if word.istitle()]

In [ ]:
# Or we can use all the words in the list but just take their first letters

[word[0] for word in list1]

In [ ]:
## EX. Using the list of words you produced by splitting 'new_string', create
##     a new list that contains only the words whose last letter is "e" 

## EX. Create a new list that contains the first letter of each word.

## EX. Create a new list that contains only words longer than two letters.

Exploratory NLP Tasks

Now that we have some of Python's basics in our toolkit, we can immediately perform the kinds of tasks that are the digital humanist's bread and butter. When we first meet a text in the wild, we often wish to find out a little about it before digging in deeply, so we start with simple questions like "How many words are in this text?" or "How long is the average word?"


In [ ]:
##     Run the cell below to read in the text of "Pride and Prejudice."

##     How many words are in the novel?
##     How many words in the novel appear in title case?
##     Approximately how long is the average word in the novel?

In [ ]:
austen_string = open('Austen - Pride and Prejudice.txt').read()